NSF PAR Search | NSF Public Access Repository

The repository link contains a README which gives an overview of the files along with the structure of the data. Additionally, for LLAMA and GPT2, the files are in human_{llm_name}{i}.jsonl format where {llm} is the name of the LLM and {i} is the partition of the file and which can be concatenated to form the full dataset for that llm.

Using Authorship Verification to Mitigate Abuse in Online Communities

Weerasinghe, J.; Singh, R.; Greenstadt, R. (May 2022, Proceedings of the International AAAI Conference on Weblogs and Social Media)

Social media has become an important method for information sharing. This has also created opportunities for bad actors to easily spread disinformation and manipulate public opinion. This paper explores the possibility of applying Authorship Verification on online communities to mitigate abuse by analyzing the writing style of online accounts to identify accounts managed by the same person. We expand on our similarity-based authorship verification approach, previously applied on large fanfictions, and show that it works in open-world settings, shorter documents, and is largely topic-agnostic. Our expanded model can link Reddit accounts based on the writing style of only 40 comments with an AUC of 0.95, and the performance increases to 0.98 given more content. We apply this model on a set of suspicious Reddit accounts associated with the disinformation campaign surrounding the 2016 U.S. presidential election and show that the writing style of these accounts are inconsistent, indicating that each account was likely maintained by multiple individuals. We also apply this model to Reddit user accounts that commented on the WallStreetBets subreddit around the 2021 GameStop short squeeze and show that a number of account pairs share very similar writing styles. We also show that this approach can link accounts across Reddit and Twitter with an AUC of 0.91 even when training data is very limited.

Full Text Available

Search for: All records